Actuarial Data Science - Open Learning Resource
Some coding examples in this lecture are adapted from R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023).
In practice, a large part of an actuary’s work involves cleaning and reshaping data to ensure that it is trustworthy and easy to analyse.
In this lecture, we introduce a small set of powerful verbs (filter(), select(), mutate(), summarise(), group_by(), and joins) that can be combined to express many common data-wrangling tasks clearly.
dplyr (a core member of tidyverse) for data manipulation and transformation
dplyr overwrites some functions in base R (e.g. filter(), lag()). To use the base versions (after loading dplyr), specify stats::filter() and stats::lag().nycflights13 package, which contains data on flights departing from New York City in 2013ggplot2 to help us explore and understand the data.int: integersdbl: doubles (real numbers)chr: character vectors (strings)dttm: date-times (a date + time)lgl: logical vectors (TRUE or FALSE)fctr: factors (categorical variables with fixed levels)date: dates.These are the common variable types used in tidyverse data frames (tibble)
Functions in the dplyr package:
%>%: pipe operator, used to chain multiple operations togetherglimpse(): a glimpse into the data and its structurefilter(): select observations (rows) that satisfy given conditionsarrange(): reorder rows based on variable valuesselect(): choose a subset of variables (columns)mutate(): create new variables or transform existing onessummarise(): reduce multiple values to a single summarygroup_by(): group data so operations are performed within each groupTo use filter() effectively, you need to know how to select observations using comparison operators. R provides the standard set:
> greater than>= greater than or equal to< less than<= less than or equal to!= not equal to== equal toNote: be cautious when using == with floating point numbers. Consider using near() instead.
filter() are combined with “and”: every condition must be true for a row to be included in the output.Other logical operators:
&: “and”|: “or”!: “not”x %in% y select rows where x is one of the values in yAccording to De Morgan’s laws:
!(x & y) is equivalent to !x | !y!(x | y) is equivalent to !x & !yQuestion:
Find all flights that departed in November or December.
Question:
Find flights that were not delayed (on arrival or departure) by more than two hours.
NA represents an unknown value, so missing values are “contagious”: almost any operation involving an unknown value will also return NA.filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, you need to include them explicitly:arrange(): order rows by column names (or more complex expressions)desc() to sort a column in descending orderselect(): select a subset of variables (columns) by namestarts_with("abc"): select variables whose names begin with “abc”ends_with("xyz"): select variables whose names end with “xyz”contains("ijk"): select variables whose names contain “ijk”.matches(): select variables that match a regular expression.everything(): select all variablesOther related functions:
rename(): rename variablesmutate(): create new variables from existing variablesPlease have a try!
# Step 1: Group flights by destination
by_dest <- group_by(flights, dest)
# Step 2: Summarise to compute statistics for each destination
delay_summary <- summarise(by_dest,
count = n(), # Number of flights
dist = mean(distance, na.rm = TRUE), # Average distance
delay = mean(arr_delay, na.rm = TRUE) # Average arrival delay
)
# Step 3: Filter to remove noisy points and outliers
delay_summary <- filter(delay_summary,
count > 20, # Keep destinations with >20 flights
dest != "HNL") # Remove Honolulu (outlier: very far)
# Create visualization: relationship between distance and delay
# It looks like delays increase with distance up to ~750 miles
# and then decrease. Maybe as flights get longer there's more
# ability to make up delays in the air?
delay_plot <- ggplot(data = delay_summary,
mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) + # Point size represents flight count
geom_smooth(se = FALSE) # Add smooth trend line
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'%>%# Same analysis using pipe operator for cleaner, more readable code
delays_summary <- flights %>%
group_by(dest) %>% # Group by destination
summarise(
count = n(), # Number of flights per destination
dist = mean(distance, na.rm = TRUE), # Average distance
delay = mean(arr_delay, na.rm = TRUE) # Average arrival delay
) %>%
filter(count > 20, dest != "HNL") # Filter: >20 flights, exclude Honoluluna.rm=TRUE: removes missing valuesmean(x), median(x)sd(x), IQR(x)min(x), quantile(x, 0.25), max(x)first(x), nth(x, 2), last(x)n(), sum(!is.na(x)), n_distinct(x)
count(tailnum, wt = distance): “count” (sum) the total number of miles a plane flewsum(x > 10), mean(y == 0)Note:
The interquartile range (IQR) is a measure of variability based on dividing a data set into quartiles. Q1 is the middle value in the lower half of the rank-ordered data. Q2 is the median value in the dataset. Q3 is the middle value in the upper half of the rank-ordered data. The interquartile range is equal to Q3 − Q1.
Three families of verbs designed to work with relational data:
Mutating joins: add new variables to one data frame from matching observations in another
Filtering joins: filter observations from one data frame based on whether they match observations in another table
Set operations: treat observations as if they were set elements.
A similar database system: SQL
nycflights13 contains five related tibbles:
flights: information about each flightairlines: full carrier namesairports: information about each airport, identified by the faa airport codeplanes: information about each plane, identified by its tailnumweather: weather data at each NYC airport for each hourRelations of Tibbles (Source: Wickham and Grolemund (2017), R for Data Science)
planes$tailnum is a primary key because it uniquely identifies each plane in the planes table.flights$tailnum is a foreign key because it links each flight to a plane in the planes table.origin is part of the primary key in the weather table, and is also a foreign key that links to the airports table.count() the primary keys and look for entries where n is greater than 1.flights table?
mutate() and row_number()flightsflights %>%
arrange(year, month, day, sched_dep_time, carrier, flight) %>%
mutate(flight_id = row_number()) %>%
#This makes it possible to see every column in a data frame.
glimpse()Rows: 336,776
Columns: 20
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2~
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ dep_time <int> 517, 533, 542, 544, 554, 559, 558, 559, 558, 558, 557, ~
$ sched_dep_time <int> 515, 529, 540, 545, 558, 559, 600, 600, 600, 600, 600, ~
$ dep_delay <dbl> 2, 4, 2, -1, -4, 0, -2, -1, -2, -2, -3, NA, 1, 0, -5, -~
$ arr_time <int> 830, 850, 923, 1004, 740, 702, 753, 941, 849, 853, 838,~
$ sched_arr_time <int> 819, 830, 850, 1022, 728, 706, 745, 910, 851, 856, 846,~
$ arr_delay <dbl> 11, 20, 33, -18, 12, -4, 8, 31, -2, -3, -8, NA, -6, -7,~
$ carrier <chr> "UA", "UA", "AA", "B6", "UA", "B6", "AA", "AA", "B6", "~
$ flight <int> 1545, 1714, 1141, 725, 1696, 1806, 301, 707, 49, 71, 79~
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N39463", "N708~
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "EWR", "JFK", "LGA", "LGA",~
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ORD", "BOS", "ORD", "DFW",~
$ air_time <dbl> 227, 227, 160, 183, 150, 44, 138, 257, 149, 158, 140, N~
$ distance <dbl> 1400, 1416, 1089, 1576, 719, 187, 733, 1389, 1028, 1005~
$ hour <dbl> 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6~
$ minute <dbl> 15, 29, 40, 45, 58, 59, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0~
$ flight_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ~
y to x:
inner_join(): keep observations that appear in both tablesleft_join(): keep all observations in xright_join(): keep all observations in yfull_join(): keep all observations in both x and yby=NULL: uses all variables that appear in both tables (a natural join)by = "x": uses a common variable named “x”by = c("a" = "b"): match variable a in table x to variable b in table ybase::merge()| dplyr | merge |
|---|---|
inner_join(x, y) |
merge(x, y) |
left_join(x, y) |
merge(x, y, all.x = TRUE) |
right_join(x, y) |
merge(x, y, all.y = TRUE) |
full_join(x, y) |
merge(x, y, all.x = TRUE, all.y = TRUE) |
dplyr joins are generally faster and preserve the order of rows.semi_join(x, y): keep all observations in x that have a match in yanti_join(x, y): drop all observations in x that have a match in yQuestion:
flights# Step 1: Find top 10 most popular destinations
top_dest <- flights %>%
count(dest, sort = TRUE) %>% # Count flights per destination, sort descending
head(10) # Keep only top 10
# Step 2: Filter flights to only those going to top destinations
flights %>%
semi_join(top_dest) # Keep only flights matching top destinationsQuestion:
When connecting flights and planes, which flights do not have a match in in planes?
anti_join().intersect(x, y): return observations that appear in both x and yunion(x, y): return all unique observations in x and ysetdiff(x, y): return observations in x but not in y